Use version-aware silo status checks#10091
Closed
ReubenBond wants to merge 7 commits into
Closed
Conversation
This was referenced May 12, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates Orleans grain-directory and locator logic to use version-aware silo status checks, treating entries as invalid only when the referenced silo is Dead (or provably stale/unknown relative to a newer membership snapshot), while keeping ShuttingDown/Stopping registrations valid until death.
Changes:
- Add
ClusterMembershipSnapshot.GetSiloStatus(SiloAddress, MembershipVersion)and switch directory/cache/handoff validation to use it for dead-only filtering with version awareness. - Rework
LocalGrainDirectorymembership processing to applyIClusterMembershipServicesnapshots and use dead-only invalidation for directory entries and cache. - Add/extend tests to cover version-aware “unknown vs dead” semantics and terminating-but-not-dead behavior.
Show a summary per file
| File | Description |
|---|---|
| test/Orleans.Runtime.Internal.Tests/LocalGrainDirectoryTests.cs | Adds unit tests for LocalGrainDirectory.IsDefunctActivation version-aware semantics. |
| test/Orleans.Runtime.Internal.Tests/GrainDirectoryPartitionTests.cs | Switches partition tests to IClusterMembershipService and adds coverage for terminating-but-not-dead behavior. |
| test/Orleans.Runtime.Internal.Tests/GrainDirectoryHandoffManagerTests.cs | Adds tests ensuring handoff transferability is dead-only and version-aware. |
| test/Orleans.Runtime.Internal.Tests/ClusterMembershipSnapshotTests.cs | Adds tests for GetSiloStatus(silo, seenAtVersion) semantics (unknown/older => Dead). |
| test/Orleans.Core.Tests/Directory/CachedGrainLocatorTests.cs | Updates wiring to include membership service and adds cache validation tests for ShuttingDown/Stopping silos. |
| src/Orleans.Runtime/Networking/SiloConnectionMaintainer.cs | Breaks outstanding messages to Dead silos and closes connections on death. |
| src/Orleans.Runtime/MembershipService/SiloStatusListenerManager.cs | Minor type change (sealed). |
| src/Orleans.Runtime/MembershipService/ClusterMembershipSnapshot.cs | Introduces the version-aware GetSiloStatus overload. |
| src/Orleans.Runtime/GrainDirectory/LocalGrainDirectoryPartition.cs | Replaces terminating-based checks with snapshot-based dead-only defunct detection. |
| src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs | Moves to snapshot-driven membership application; dead-only invalidation; refreshes membership when entries are newer than applied snapshot. |
| src/Orleans.Runtime/GrainDirectory/ILocalGrainDirectory.cs | Removes IsSiloInCluster from the internal interface. |
| src/Orleans.Runtime/GrainDirectory/GrainDirectoryPartition.Interface.cs | Uses version-aware dead detection when deciding whether an entry is dead. |
| src/Orleans.Runtime/GrainDirectory/GrainDirectoryHandoffManager.cs | Uses snapshot dead-only filtering for transferable registrations; changes pending-operation retry behavior. |
| src/Orleans.Runtime/GrainDirectory/CachedGrainLocator.cs | Changes proactive cleanup to dead-only and uses version-aware status checks for cached entries. |
| src/Orleans.Runtime/Catalog/Catalog.cs | Removes directory-owned silo-status-change deactivation logic (moved elsewhere). |
| src/api/Orleans.Runtime/Orleans.Runtime.cs | Updates generated public API surface for the new GetSiloStatus overload. |
Copilot's findings
Comments suppressed due to low confidence (2)
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:677
UnregisterAsync: the retry-delay/recheck gate useshopCount > 1, so the first forwarded request (hopCount == 1) will forward again without delay/revalidation. This is inconsistent withLookupAsync/DeleteGrainAsync(hopCount > 0) and may reintroduce fast hop chains when directory ownership is unstable. Consider usinghopCount > 0(or otherwise aligning the hop-count semantics) so forwarded unregisters also get a stabilization delay.
await RefreshMembershipIfNewer(address.MembershipVersion);
// see if the owner is somewhere else (returns null if we are owner)
var forwardAddress = this.CheckIfShouldForward(address.GrainId, hopCount, "UnregisterAsync");
// After the first forward, we insert a retry delay and recheck owner before forwarding again
if (hopCount > 1 && forwardAddress != null)
{
await Task.Delay(RETRY_DELAY);
forwardAddress = this.CheckIfShouldForward(address.GrainId, hopCount, "UnregisterAsync");
src/Orleans.Runtime/GrainDirectory/CachedGrainLocator.cs:190
ListenToClusterChangecomputessnapshot.CreateUpdate(previousSnapshot)but never updatespreviousSnapshotinside the loop. As a result, every iteration diffs against the initial snapshot, which can repeatedly re-process the same dead silos and grow the change set over time. UpdatingpreviousSnapshot = snapshotat the end of each loop will make the processing incremental and avoid redundantUnregisterSiloscalls.
var previousSnapshot = this.clusterMembershipService.CurrentSnapshot;
((ITestAccessor)this).LastMembershipVersion = previousSnapshot.Version;
var updates = this.clusterMembershipService.MembershipUpdates.WithCancellation(this.shutdownToken.Token);
await foreach (var snapshot in updates)
{
// Active filtering: detect dead silos and try to clean proactively the directory
var changes = snapshot.CreateUpdate(previousSnapshot).Changes;
var deadSilos = changes
.Where(member => member.Status == SiloStatus.Dead)
.Select(member => member.SiloAddress)
.ToList();
if (deadSilos.Count > 0)
{
var tasks = new List<Task>();
foreach (var directory in this.grainDirectoryResolver.Directories)
{
tasks.Add(directory.UnregisterSilos(deadSilos));
}
await Task.WhenAll(tasks).WaitAsync(this.shutdownToken.Token);
}
((ITestAccessor)this).LastMembershipVersion = snapshot.Version;
}
- Files reviewed: 16/16 changed files
- Comments generated: 2
77123df to
c845b97
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When background Orleans work logs after xUnit has cleared the current test context, ITestOutputHelper throws InvalidOperationException with "There is no currently active test." That exception can escape through Microsoft.Extensions.Logging and abort the test host. Fall back to stderr for that specific late-log case so the original runtime log is still emitted without crashing the test process. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use ClusterMembershipSnapshot.GetSiloStatus with registration membership versions when deciding whether grain directory entries are dead. This keeps shutting down and stopping silos valid until they are marked dead while still filtering old unknown or replaced silos. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Advance the previous membership snapshot after each cache cleanup pass and stamp manually cached entries with the current membership version so version-aware cache validation does not evict them immediately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c845b97 to
6cc9319
Compare
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part 6 of 7 split from #10085.
Problem:
Directory and cache validation need to treat entries as invalid only when the referenced silo is Dead, or when the silo is unknown in a snapshot newer than the entry's membership version. ShuttingDown and Stopping silos should remain valid until they are Dead.
Solution:
Add ClusterMembershipSnapshot.GetSiloStatus(silo, seenAtVersion), update LocalGrainDirectory, handoff, cached locator, and partition validation to use it, and add coverage for the version-aware semantics.
Stack:
Merge after #10090. This branch is stacked on split/pr10085-05-dead-silo-message-break; until earlier PRs merge, GitHub may show earlier stack changes. Incremental compare: ReubenBond/orleans@split/pr10085-05-dead-silo-message-break...split/pr10085-06-version-aware-status
Review focus:
Version-aware GetSiloStatus semantics, Dead-only filtering, and preserving ShuttingDown/Stopping registrations.